TIR Project

Search for bacterial TIR domain-containing proteins.

1) Set the working directory.


In [2]:
cd ~/Data/tir_project

2) Download the assembly summary from NCBI FTP.


In [ ]:
wget ftp://ftp.ncbi.nlm.nih.gov/genomes/ASSEMBLY_REPORTS/assembly_summary_genbank.txt
Column number Column name Example
1 assembly_accession GCA_000174395.2
2 bioproject PRJNA30627
3 biosample SAMN00002237
4 wgs_master
5 refseq_category reference genome
6 taxid 333849
7 species_taxid 1352
8 organism_name Enterococcus faecium DO
9 infraspecific_name strain=DO
10 isolate
10 version_status latest
12 assembly_level Complete Genome
13 release_type Major
14 genome_rep Full
15 seq_rel_date 2012/05/25
16 asm_name ASM17439v2
17 submitter Baylor College of Medicine
18 gbrs_paired_asm GCF_000174395.2
19 paired_asm_comp identical
20 ftp_path ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/174/395/GCA_000174395.2_ASM17439v2
21 excluded_from_refseq

3) Download categories.dmp from NCBI which links top level category (e.g. bacteria) to taxon ID


In [ ]:
wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxcat.tar.gz
tar -xvzf taxcat.tar.gz
rm taxcat.tar.gz

categories.dmp contains a single line for each node that is at or below the species level in the NCBI Taxonomy database.

The first column is the top-level category -

A = Archaea B = Bacteria E = Eukaryota V = Viruses and Viroids U = Unclassified O = Other

The third column is the taxid itself, and the second column is the corresponding species-level taxid.

These nodes in the taxonomy -

242703 - Acidilobus saccharovorans 666510 - Acidilobus saccharovorans 345-15

will appear in categories.dmp as -

A 242703 242703
A 242703 666510


In [7]:
wc -l assembly_summary_genbank.txt


   92276 assembly_summary_genbank.txt

4) Extract bacterial assembly records and create link to protein file The FTP link for the assembly is:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/174/395/GCA_000174395.2_ASM17439v2

Need to generate the FTP link for the protein file:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/174/395/GCA_000174395.2_ASM17439v2/GCA_000174395.2_ASM17439v2_protein.faa.gz
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/174/395/GCA_000174395.2_ASM17439v2/GCF_000174395.2_ASM17439v2_protein.faa.gz

Link to protein file is comprised by column 20 / column 1 _ column 16 _protein.faa.gz


In [38]:
PATH=$PATH:~/Data/Notebooks/tir_project
extract_bacteria_assemblies.sh

5) Summarize genomes


In [14]:
awk 'BEGIN{FS="\t"} {gen[$5"\t"$12]++} END{for (x in gen) {print x"\t"gen[x]}}' bacteria_only_assembly_summary_genbank.txt | sort


na	Chromosome	1044
na	Complete Genome	4711
na	Contig	36884
na	Scaffold	38489
reference genome	Chromosome	2
reference genome	Complete Genome	118
representative genome	Chromosome	110
representative genome	Complete Genome	1468
representative genome	Contig	1761
representative genome	Scaffold	1741

6) Get protein sequences from bacterial assemblies which are reference genomes (column 5)


In [ ]:
awk ' BEGIN{FS="\t"}($5 == "reference genome"){print $22}' bacteria_only_assembly_summary_genbank.txt | xargs -L 1 wget --quiet -P ~/Data/tir_project/reference_genomes

7) Get protein sequences from bacterial assemblies which are representative genomes (column 5)


In [3]:
rg=($(awk ' BEGIN{FS="\t"}($5 == "representative genome"){print $22}' bacteria_only_assembly_summary_genbank.txt))

In [ ]:
x=0; 
err=0; 
for i in "${rg[@]}" 
do 
    ((x++))
    echo $x": "$i
    var="representative_genomes/"${i##*/}
    string=".faa.gz" 
    
    if [[ ! $i == *".faa.gz" ]]
    then 
        echo $i >> ~/Data/tir_project/representative_genomes_ftp.err; 
        ((err++)); 
    else 
        if [ ! -e $var ] 
        then 
            wget -P ~/Data/tir_project/representative_genomes $i 
        fi
        if [ ! -e $var ]
        then j=${i//GCA/GCF}; 
            wget -P ~/Data/tir_project/representative_genomes $j
        fi
    fi
done